Record: QK-Gain 4.0 + XSA-11 + Muon-TTT + SLOT — val_bpb 1.0914 (3-seed mean) by bigbag · Pull Request #1176 · openai/parameter-golf

bigbag · 2026-03-31T09:45:23Z

Summary

val_bpb: 1.0914 (3-seed mean, std 0.0003) | ≤16.0 MB | 8×H100 SXM | ~87.2ms/step | ~6884 steps

Built on PR #1135 (@barneywohl) with four additions:

QK_GAIN_INIT=4.0 — from PR Non-record: XSA-All + QK Gain 4.0 + LN Scale — 45 Experiments on 1×RTX 5090 #1125's 45-experiment sweep, validated independently on 3 codebases
XSA expanded to all 11 layers (was 4 in Record: Fused Triton MLP + Full GPTQ + Coprime Loader + XSA-all + BH2816 (val_bpb 1.1116) #1135)
Muon-TTT enabled (score-first, 3 epochs) — already in Record: Fused Triton MLP + Full GPTQ + Coprime Loader + XSA-all + BH2816 (val_bpb 1.1116) #1135 but disabled by default
SLOT eval-time delta optimization — our code addition (arXiv:2505.12392v2), 8 AdamW steps, lr=0.005, per-batch 512-dim delta at last hidden layer

3-Seed Results

Seed	Sliding BPB	+ TTT BPB	+ SLOT BPB	Steps	ms/step
42	1.11542	1.11209	1.09119	6885	87.2
1337	1.11575	1.11240	1.09166	6879	87.2
2024	1.11572	1.11235	1.09148	6887	87.1
Mean	1.11563	1.11228	1.09144 ± 0.00023

Beats merged SOTA (PR #1019, 1.1147) by 0.023 BPB (p ≪ 0.01).

Improvement Breakdown

Technique	BPB Impact	Cumulative
PR #1135 base (no TTT)	1.1173 (sliding)	1.1173
+ QK_GAIN=4.0	-0.006	~1.1155
+ XSA all 11 layers	-0.002	~1.1152
+ Muon-TTT 3ep	-0.003	~1.1123
+ SLOT 8 steps lr=0.005	-0.021	~1.0915

Legality

Training (≤600s on 8×H100)

Standard transformer training with Parallel Muon optimizer
QK_GAIN_INIT=4.0 is a hyperparameter choice — no rule restricts it
XSA on all layers is a standard architectural choice
Full Hessian GPTQ calibration runs within the 600s training budget
No validation data accessed during training

Evaluation — TTT (score-first, ≤10 min additional)

Score-first protocol: Each chunk scored under torch.inference_mode() FIRST. NLL recorded BEFORE any parameter update.
After scoring, parameters updated via SGD on already-scored tokens. Same legal pattern as merged SOTA PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549.
Tokens are never re-scored after parameter updates.
TTT runs in ~460-485s across 8 GPUs.

Evaluation — SLOT (legal, within eval budget)

Optimizes additive delta vector at last hidden layer — model weights frozen.
Hidden states computed under torch.no_grad() and .detach()ed from model graph.
Gradients only flow through final linear projection, not through transformer.
Standard autoregressive loss preserves causality.
Based on published work: Hu et al. arXiv:2505.12392v2.
SLOT runs in ~275s. Total eval (sliding ~100s + TTT ~475s + SLOT ~275s) = ~850s within 10-min additional eval budget.

No illegal techniques

❌ No n-gram cache
❌ No two-pass rescoring
❌ No min-NLL epoch selection
❌ No eval-time GPTQ on training data
❌ No oracle/hindsight selection

Reproduction

QK_GAIN_INIT=4.0 TTT_ENABLED=1 SLOT_ENABLED=1 SLOT_STEPS=8 SLOT_LR=0.005 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Training: ~600s. Eval (sliding + TTT + SLOT): ~850s. Total: ~25 min end-to-end.

Acknowledgments

PR #1135 (@barneywohl), PR #1125 (qk_gain sweep), PR #1128 (SLOT reference), PR #549 (legal TTT pattern), Hu et al. arXiv:2505.12392v2.

🤖 Generated with Claude Code

…ed mean) 3-seed mean: 1.0962 BPB (std 0.0005) Seeds: 1337=1.0957, 42=1.0963, 2024=1.0966 Beats merged SOTA (1.1147) by 0.019 BPB Built on PR openai#1135 with: QK_GAIN_INIT=4.0, XSA all 11 layers, Muon-TTT (score-first, 3 epochs), SLOT eval-time delta optimization. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Novel contribution: shallow recurrence (layers 4,5 repeated once each) with rank-2 LoRA corrections on attention projections, RMSNorm before repeat, and learnable alpha scaling. 13 virtual layers from 11 physical layers at 28KB (0.18%) parameter overhead. Hyperparameter changes from PR openai#1179 base (1.1105 BPB): - NEGATIVE_SLOPE: 0.5 -> 0.9 (validated +0.013 BPB in issue openai#140) - QK_GAIN_INIT: 1.5 -> 4.0 (validated +0.006 BPB in PR openai#1176) - TTT_ENABLED: 1 (score-first, legal variant) - WARMDOWN_ITERS: 4000 (extended from 3500) - BIGRAM_DIM: 160 (from 112) Status: WIP - awaiting compute for 3-seed validation runs.

msisovic · 2026-03-31T18:53:45Z

This SLOT implementation, like the ones before it, violates causality.

bigbag changed the title ~~Record: QK-Gain 4.0 + XSA-11 + Muon-TTT + SLOT — val_bpb 1.0962 (3-seed mean)~~ Record: QK-Gain 4.0 + XSA-11 + Muon-TTT + SLOT — val_bpb 1.0915 (3-seed mean) Mar 31, 2026

notapplica mentioned this pull request Mar 31, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

bigbag changed the title ~~Record: QK-Gain 4.0 + XSA-11 + Muon-TTT + SLOT — val_bpb 1.0915 (3-seed mean)~~ Record: QK-Gain 4.0 + XSA-11 + Muon-TTT + SLOT — val_bpb 1.0914 (3-seed mean) Mar 31, 2026

andrewbaggio1 mentioned this pull request Apr 1, 2026

Record: Full GPTQ + Score-First TTT + SLOT — val_bpb 1.1064 (3-seed mean) #1209

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: QK-Gain 4.0 + XSA-11 + Muon-TTT + SLOT — val_bpb 1.0914 (3-seed mean)#1176

Record: QK-Gain 4.0 + XSA-11 + Muon-TTT + SLOT — val_bpb 1.0914 (3-seed mean)#1176
bigbag wants to merge 1 commit intoopenai:mainfrom
bigbag:submission/qkgain4-xsa11-ttt-slot

bigbag commented Mar 31, 2026 •

edited

Loading

Uh oh!

msisovic commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bigbag commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

3-Seed Results

Improvement Breakdown

Legality

Training (≤600s on 8×H100)

Evaluation — TTT (score-first, ≤10 min additional)

Evaluation — SLOT (legal, within eval budget)

No illegal techniques

Reproduction

Acknowledgments

Uh oh!

msisovic commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bigbag commented Mar 31, 2026 •

edited

Loading